Motivation

Are you a broke college student? Have you ever wondered when the heck flight prices drop, and if there is such thing as an optimal time to splurge on tickets to see your long distance partner? This started off as an obsessive screenshotting extravaganza that involved opening Google Flights, searching up the desired flight, and copy + pasting the price history graphs of the exact same flight path on the same day of the week but also for the weeks before and after the desired date. That information still resides in my initial mess of a Google Docs file, but I’ve decided that it would be a shame of me as a data scientist and MS Candidate to keep eyeballing everything instead of just making my own graphs in R, so now I’ve regretfully taken it to the next level by fiddling with R code and making this a more official-looking data science project.

Methodology

I’ve been tracking flight prices from Newark Liberty International Airport (EWR) to Minneapolis-St. Paul International Airport (MSP) for the following Mondays: March 18, March 25, April 8, and April 15, 2024. Specifically, I’ve been logging (no, not the math definition) the lowest price for a one-way nonstop ticket for each day. Most of my data come from Google Flights, though I’ve noticed a few discrepencies in their data: a few days ago, the website highlighted Sun Country Airlines for having the alleged lowest price (on another flight I was looking at, but not any of the ones in this analysis), but they actually had already increased that original price from the previous day, so Google was wrong(!!). [3/5/24 Update: There also are a few other discrepencies between the Google Flight history and my own records near the end of Feburary, aka when I was able to look up the flights on United.com myself and just screenshot the fare calendar and record the numbers in my CSV files. However, I expect most of the past data to be accurate.]

The prices I’ve been tracking on the four departure days are each contained in their own CSV file. My desired flight is the one on April 8. I skipped April 1, since those prices seemed a lot higher overall, likely due to Easter weekend. Though I tracked April 15’s prices since it is the following week’s Monday, I didn’t end up using it, since Sun Country’s flight would have been the cheapest one, and they don’t have any flights for the other three Mondays in the dataset (whose flights are mainly operated by Delta and United Airlines). Sun Country is known for being a budget airline that offers flights to and from MSP, so I omitted April 15 from my analysis. (It’s still loaded in the data, though.)

Data Preprocessing

Oh, right! Where’s our data documentation? Here it is:

Column Name type Description
flight_date chr one of March 18, March 25, April 8, or April 15
days_before int number of days before the flight date
lowest_price int lowest price (USD) for a one-way nonstop ticket (contingent upon Google’s psychic powers that day)
day chr day of the week (here, it’s just Mondays)
origin chr origin airport (all EWR, nothing to see)
destination chr destination airport (all MSP, nothing to see)
sun_country chr whether Sun Country Airlines is the cheapest option for that day (yes or no)
date_recorded Date date the price was “observed”
# initialize data frame
d = NULL

# read in four files, one for each Monday tracked
dates <- c("03-18", "03-25", "04-08", "04-15")
filenames <- paste0(dates, "-2024_ewr_msp.csv")

for (f in filenames) { 
  d <- rbind(d, read.csv(f))
}

# convert date_recorded to date object for plotting purposes
d$date_recorded <- as.Date(d$date_recorded, "%m-%d-%Y")

# discard rows where sun_country is yes, and add label for plotly hover text
d <- d %>% 
  filter(sun_country == 'no') %>%
    mutate(txtlabel = paste0('Recorded: ', date_recorded,
                             '\nDays before: ', days_before,
                             '\n<b>Lowest price: $', lowest_price,'</b>')) 
d %>% 
  select(flight_date, days_before, lowest_price, date_recorded) %>%
  head()
##   flight_date days_before lowest_price date_recorded
## 1  03-18-2024          85          224    2023-12-24
## 2  03-18-2024          84          224    2023-12-25
## 3  03-18-2024          83          224    2023-12-26
## 4  03-18-2024          82          224    2023-12-27
## 5  03-18-2024          81          224    2023-12-28
## 6  03-18-2024          80          224    2023-12-29

For the record, I didn’t actually start recording dates until four days ago (February 19), but Google Flights price history can go back as far as 60 days. Let’s just say that date_recorded corresponds to the day you would have known what the cheapest price was, even though we acknowledge having this retrospective power in our hands. This column will be useful for one of my plots below… stay tuned!

I’ll also make a subset of the data that only contains the observations that aren’t the same as the preceding or following day’s lowest price, which will be useful for adding markers to our line plots.

# thanks to
# https://stackoverflow.com/questions/37610056/how-to-treat-nas-like-values-when-comparing-elementwise-in-r
# for the `%!=na%` function, to compare numbers to NA's

`%!=na%` <- function(e1, e2) {
  e1 != e2 | (is.na(e1) & !is.na(e2)) | (is.na(e2) & !is.na(e1)) & !(is.na(e1) & is.na(e2))
}

# discard "middle" prices
nomids <- d %>%
  filter(date_recorded <= '2024-02-23') %>%
  group_by(flight_date) %>%
  mutate(prev = lag(lowest_price), follow = lead(lowest_price)) %>%
  mutate(keep = (prev %!=na% lowest_price | follow %!=na% lowest_price)) %>%
  filter(keep) %>%
  ungroup()

Plotting Time

Flight Prices vs. Time Before Departure

Oof, this is a lot of code. I’m going to try to make this as interactive as possible, so I’m using the plotly package that enables a text bubble to appear when you hover your cursor over the lines. Here, I plot the lowest price for each day against the number of days before the departure date that it was recorded. I’m also going to include a few purple horizontal lines ($120, $140, and $160) to indicate the price range that I’m willing to pay for a one-way ticket.

Hovering over the lines/markers will show how many days before departure date and the cheapest price before recorded.

As of February 23, we are about 6.5 weeks before departure. The last substantial drop in prices for the other two flights seems to happen around 5.5 weeks… so I’m hoping to see a similar trend for the April 8th flight.

Flight Prices vs. Date Recorded

I wanted to also make a plot of prices against the date they were observed.

Weirdly enough, it looks like March 25 and April 8 have been following the same exact price trend for the last seven days, even though these flights are two weeks apart. What does this forbode for next week’s prices?

Here is the same information using a different type of interactive feature: a vertical line that follows the cursor. This is useful for comparing the prices that were recorded on the same day for each of the flights.

Conclusion

As of Feburary 23, 45 days (or 6.5 weeks) before April 8, I am hoping that the cheapest price will make one final drop sometime next week. I hope this won’t come back to bite me in the butt, but I shall wait until then to buy the ticket.

[3/5/24] Update

11 days later… I have reappeared from the shadows to report that the prices have indeed dropped! Well, price (singular) would be more accurate, since the cheapest nonstop price on March 18 and March 25 jumped by about $80 today, while that of April 8 dropped by $80.

It’s been a tense week and a half waiting on April 8’s price to drop, especially on February 25, when all other prices dropped to a reasonable price EXCEPT for April 8. I was so puzzled as to why it didn’t drop, and I wondered if I too blindly trusted this extremely tiny dataset of two preceding Mondays’ three months worth of price history… but I’m glad I waited, because it not only dropped but dropped to a price I’d be willing to pay.

In the following plot, I chopped off the top of the y-axis to make the numbers easier to read, but for the record, after Feburary 26, the price for March 18 jumped to $305, and today (March 5), the price for March 25 jumped to $305:

I was really expecting a price drop for April 8 about 5.5 weeks before the departure date, but it ended up happening under 5 weeks from the departure date, as seen in the updated plot below (green). It’s not good for the heart to rely on price drops within 5 weeks or less before the departure date, but it seemed to work out this time. (Phew!) Below, the y-axis is chopped off once again:

Lady Luck seemed to finally shine on me today. Or could this have been foreseen earlier? Whichever the case, I’m happy.

Special thanks…

…to the following forum posts for teaching me lots of things about plotly and ggplot2 that I didn’t know before: